Data Quality

Data Quality has gained paramount importance because businesses today use data for decision making. Bad quality data can lead to imperfect reports, which in turn lead to misguided conclusions, increase in operational costs, and problems for downstream users of data. With data coming from disparate sources and with no control over the quality of the source data, it becomes necessary to transform and cleanse the data to make it suitable for data analysis. It is crucial that data that is used for analytical purposes provides accurate inputs, else inaccurate data can lead to incorrect output or results. Flawed data leads to faulty insights for business critical decisions that can cost companies time, money, and resources.

Data quality provides an aggregate score of the overall quality of the data and provides a percentage rating that shows how accurate the data is.

What is Data Quality?

Some important elements that define the quality of your dataset include:

Accuracy - Checks the level to which data conforms to a defined standard. For example, if the date is required to be mentioned in the mm:dd:yy format and in instead mentioned in the dd:mm:yy format, then this data is inaccurate.
Completeness - Checks if the data has all the required values with no missing information. For example, a complete record of address of an employee guarantees that the employee is reachable.
Consistency - Checks if the information that is stored and used at multiple instances matches. For example, if some phone records are stored with international code separately and some are prefixed with the international code, then there is need for consistency.
Uniqueness - Checks if a record has a single instance in a data set.
Validity - Checks the degree to which data matches with business rules or definitions accurately. It also checks if data conforms to the correct, accepted formats and values of the dataset fall within the proper range.

Data Quality in Data Pipeline Studio

You can perform data quality using Databricks or Snowflake data quality capabilities, on the dataset available in an Amazon S3 data lake or a Snowflake data lake. The technology that you use entirely depends on the organizational preference. If you are using a Snowflake data lake, then you must use Snowflake data quality capabilities. In the case of Amazon S3 data lake, you can either use Databricks for data quality. Data Pipeline Studio lets you add the following data quality stages in your pipeline, to enhance the quality of your data:

Data Profiler - this is the stage which runs an analysis on a sample piece of data against selected parameters like completeness, validity, character count and so on which gives you statistics of the data on those parameters. If you have used validity as a constraint in the data profiler job, then after output of the profiler job is available you can create a validator job. The validator results provide more information about the data pattern in the selected columns.

Databricks Data Profiler

Snowflake Data Profiler
Data Analyzer - in this stage an analysis is performed on the complete dataset, based on the selected constraints. This job runs in two parts.
- First you create and run a data analyzer job by adding the required constraints.
- Once the job run is complete you create a data validator job. In this job you can add the constraints of the data analyzer job as well as new constraints as per your requirement.
Databricks Data Analyzer

Snowflake Data Analyzer
Issue Resolver - in this stage you further enhance the quality of data by resolving some of the data-related issues that are found. You can achieve this by running the data through various constraints to improve the quality of the data, in the following ways:
- Handling duplicate data
- Handling missing data
- Handling outliers
- Specifying the partitioning order
- Handling string operations
- Handling case sensitivity
- Replace selective data
- Handle data against master table
Databricks Issue Resolver

Snowflake Data Issue Resolver

Are any data quality stages mandatory?

The simple answer to this question is no. The data quality stages mentioned above are optional and none of the stages is a prerequisite for the next one. You can use each stage independently as long as you have a valid usecase and the required data for that particular stage. For example, you can do profiling of data for a better understanding of your dataset. Alternately, you can skip this stage and directly use data analyzer to perform a complete analysis of your dataset. If you have data which is already analyzed, you can directly use the issue resolver stage to resolve some of the issues with that data.

What's next? Databricks Data Profiler